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Abstract 

In this paper we consider sparse approximation problems, that is, general Iq minimization prob- 
lems with the Zo-"norm" of a vector being a part of constraints or objective function. In particular, 
we first study the first-order optimality conditions for these problems. We then propose penalty de- 
composition (PD) methods for solving them in which a sequence of penalty subproblems are solved 
by a block coordinate descent (BCD) method. Under some suitable assumptions, we establish that 
any accumulation point of the sequence generated by the PD methods satisfies the first-order opti- 
mality conditions of the problems. Furthermore, for the problems in which the Iq part is the only 
nonconvex part, we show that such an accumulation point is a local minimizer of the problems. In 
addition, we show that any accumulation point of the sequence generated by the BCD method is a 
saddle point of the penalty subproblem. Moreover, for the problems in which the Iq part is the only 
nonconvex part, we establish that such an accumulation point is a local minimizer of the penalty 
subproblem. Finally, we test the performance of our PD methods by applying them to sparse logistic 
regression, sparse inverse covariance selection, and compressed sensing problems. The computational 
results demonstrate that our methods generally outperform the existing methods in terms of solution 
quality and/or speed. 

Key words: Iq minimization, penalty decomposition methods, block coordinate descent method, 
compressed sensing, sparse logistic regression, sparse inverse covariance selection 

1 Introduction 

Nowadays, there are numerous applications in which sparse solutions are concerned. For example, in 
compressed sensing, a large sparse signal is decoded by using a relatively small number of linear mea- 
surements, which can be formulated as finding a sparse solution to a system of linear equalities and/or 
inequalities. The similar ideas have also been widely used in linear regression. Recently, sparse inverse 
covariance selection becomes an important tool in discovering the conditional independence in graph- 
ical models. One popular approach for sparse inverse covariance selection is to find an approximate 
sparse inverse covariance while maximizing the log- likelihood (see, for example, [E]). Similarly, sparse 
logistic regression has been proposed as a promising method for feature selection in classification prob- 
lems in which a sparse solution is sought to minimize the average logistic loss (see, for example, 
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Mathematically, all these applications can be formulated into the following Iq minimization problems: 



min{/(x) : g(x) < 0, h(x) = 0, ||xj||o < r} 
mm{f(x) + ^||xj|| : g(x) < 0, h(x) = 0} 



(1) 
(2) 



for some integer r > and v > controlling the sparsity of the solution, where X is a closed convex 
set in the n-dimensional Euclidean space 3? n , / : 3? n — > K, g : 3? ra — > W 1 and h : 3? n — > W are 
continuously differentiable functions, and ||xj||o denotes the cardinality of the subvector formed by the 
entries of x indexed by J. Some algorithms are proposed for solving special cases of these problems. For 
example, the iterative hard thresholding algorithms |26[ [SJ E] and matching pursuit algorithms [381 152"] 
are developed for solving the Zo-regularized least squares problems arising in compressed sensing, but 
they cannot be applied to the general Iq minimization problems (pQ) and ([2]). In the literature, one 
popular approach for dealing with dU) and ([2]) is to replace || • ||o by the Zi-norm || • ||i and solve the 
resulting relaxation problems instead (see, for example, [HI EH QUI EI] ) • For some applications such 
as compressed sensing, it has been shown in [8] that under some suitable assumptions this approach 
is capable of solving (pQ) and ([2]). Recently, another relaxation approach has been proposed to solve 
problems ([I]) and ([2]) in which || • ||q is replaced by Z p -"norm" || • || p for some p G (0, 1) (see, for example, 
[91 [TTl [12]). In general, it is not clear about the solution quality of these approaches. Indeed, for the 
example given in the Appendix, the L relaxation approach for p E (0, 1] fails to recover the sparse 
solution. 

In this paper we propose penalty decomposition (PD) methods for solving problems ([I]) and ([2]) 
in which a sequence of penalty subproblems are solved by a block coordinate descent (BCD) method. 
Under some suitable assumptions, we establish that any accumulation point of the sequence generated 
by the PD method satisfies the first-order optimality conditions of ([1]) and ([2]). Furthermore, when /i's 
are affine, and / and g's are convex, we show that such an accumulation point is a local minimizer 
of the problems. In addition, we show that any accumulation point of the sequence generated by 
the BCD method is a saddle point of the penalty subproblem. Moreover, when /i's are affine, and / 
and g's are convex, we establish that such an accumulation point is a local minimizer of the penalty 
subproblem. Finally, we test the performance of our PD methods by applying them to sparse logistic 
regression, sparse inverse covariance selection, and compressed sensing problems. The computational 
results demonstrate that our methods generally outperform the existing methods in terms of solution 
quality and/or speed. 

The rest of this paper is organized as follows. In Subsection 11.11 we introduce the notation that is 
used throughout the paper. In Section [21 we establish the first-order optimality conditions for general Iq 
minimization problems. In Section [31 we study a class of special Iq minimization problems. We develop 
the PD methods for general Iq minimization problems in Section[4]and establish some convergence results 
for them. In Section [51 we conduct numerical experiments to test the performance of our PD methods for 
solving sparse logistic regression, sparse inverse covariance selection, and compressed sensing problems. 
Finally, we present some concluding remarks in section [6l 

1.1 Notation 

In this paper, the symbols K n and 3ft™ denote the n-dimensional Euclidean space and the nonnegative 
orthant of 3ft n , respectively. Given a vector v G 3ft™, the nonnegative part of v is denoted by v + = 
max(v, 0), where the maximization operates entry-wise. For any real vector, || • ||o and || • || denote the 
cardinality (i.e., the number of nonzero entries) and the Euclidean norm of the vector, respectively. 
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Given an index set L C {1, . . . , n}, \L\ denotes the size of L, and the elements of L are denoted by 
L(l), . . . , L(|L|), which are always arranged in ascending order, xl denotes the subvector formed by the 
entries of x indexed by L. Likewise, Xl denotes the submatrix formed by the columns of X indexed by L. 
In addition, For any two sets A and B, the subtraction of A and B is given by A\B = {x E A : x ^ B}. 
Given a closed set C C !R Tl , let Mc{x) and Tc{x) denote the normal and tangent cones of C at any 
x E C, respectively. The space of all m x n matrices with real entries is denoted by $t mxn , and the 
space of symmetric n x n matrices is be denoted by S n . We denote by / the identity matrix, whose 
dimension should be clear from the context. If X E S n is positive semidefinite (resp., definite), we write 
X y (resp., X >~ 0). The cone of positive semidefinite (resp., definite) matrices is denoted by 5" 
(resp., <5?_i_). ^ is an operator which maps a vector to a diagonal matrix whose diagonal consists of the 
vector. Given an n x n matrix X, 3>(X) denotes a diagonal matrix whose ith diagonal element is Xa 
for £ = 1, ... ,n. 



2 First-order optimality conditions 

In this section we study the first-order optimality conditions for problems ([T|) and ([2]). In particular, 
we first discuss the first-order necessary conditions for them. Then we study the first-order sufficient 
conditions for them when the Iq part is the only nonconvex part. 

We now establish the first-order necessary optimality conditions for problems (PQ) and (J2|). 

Theorem 2.1 Assume that x* is a local minimizer of problem (CJ). Let J* C J be an index set with 
| J* | = r such that x* = for all j G J*, where J* = J \ J*. Suppose that the following Robinson 
condition 



g'(x*)d — v 
h'{x*)d 



: d e Tx(x*),v E K m , Vi < 0, i e A{x*) } = 5R m x W x ftl J \~ r (3) 



holds, where g'(x*) and h'(x*) denote the Jacobian of the functions g = (gi, . . . , g m ) and h = (hi, . . . ,h p ) 
at x* , respectively, and 

A(x*) = {l<i<m: gi (x*) = 0}. (4) 
Then, there exists (A* , fi* , z*) E 5R m x 5i p x 5i n together with x* satisfying 

-V/(x*) - Vg(x*)X* - Vh(x*)n* - z* E M x {x*), 
A* > 0, \*gi{x*) = 0, i = 1, . . . ,m; z] = 0, j E JU J*. 

where J is the complement of J in {1, . . . ,n}. 



(5) 



Proof. By the assumption that x* is a local minimizer of problem ([T]), one can observe that x* is 
also a local minimizer of the following problem: 

min{/(x) : g(x) < 0, h(x) = 0, xj* = 0}. (6) 

Using this observation, ([3]) and Theorem 3.25 of [47], we see that the conclusion holds. ■ 
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Theorem 2.2 Assume that x* is a local minimizer of problem Let J* = {j G J : Xj ^ 0} and 
J* = J \ J* . suppose that the following Robinson condition 



g'{x*)d - v 
h'(x*)d 
(Ij*) T d 



: d G Tx(x*), v G 5R m , v t < 0, i G A{x*) =rxjf p x 3?I J *I (7) 



ZioWs, w/iere .A(a;*) is defined in fH). T/ien, there exists (X*,fi*,z*) G 5fi m x W x Ji n together with x* 
satisfying ([5]). 

Proof. It is not hard to observe that x* is a local minimizer of problem ([2]) if and only if x* is a 
local minimizer of problem ([6]). Using this observation, (|7|) and Theorem 3.25 of [47], we see that the 
conclusion holds. ■ 

We next establish the first-order sufficient optimality conditions for problems (pQ) and ([2|) when the 
Iq part is the only nonconvex part. 

Theorem 2.3 Assume that h's are affine functions, and f and g's are convex functions. Let x* be a 
feasible point of problem HP, and let J* = {J* C J : \J*\ = r, x* = 0, Vj G J\ J*}. Suppose that for 
any J* G J* , there exists some (A* , /i* , z*) G 5R m x W x K n sitc/i i/iai ([5]) holds. Then, x* is a local 
minimizer of problem ([T]j. 



Proof. It follows from the above assumptions and Theorem 3.34 of [47] that x* is a minimizer of 
problem ([6]) for all J* G { J \ J* : J* G J"*}. Hence, there exists e > such that f(x) > f(x*) for all 
x G Uj*(zj*Oj*(x*; e), where 

Oj^x^e) = G : #(x) < 0, = 0, xj* = 0, ||s - x*\\ < e} 

with J* = J\J*. One can observe from ([T]) that for any x G C(x*; e), where 

0(x*; e) = {x G X : g{x) < 0, h(x) = 0, ||xj|| < r, \\x - x*\\ < e}, 

there exists J* G J* such that a; G Oj*(x*;e) and hence /(x) > f(x*). It implies that the conclusion 
holds. ■ 



Theorem 2.4 Assume that h's are affine functions, and f and g's are convex functions. Let x* be a 
feasible point of problem |IjJ ; and let J* = {j G J : x* ^ 0}. Suppose that for such J*, there exists some 
(A* , fi* , z*) G W 1 xPx W 1 such that ([5]) holds. Then, x* is a local minimizer of problem (dp. 

Proof. By virtue of the above assumptions and Theorem 3.34 of [37], we know that x* is a minimizer 
of problem ([6]) with J* = J \J* . Also, we observe that any point is a local minimizer of problem ([2]) if 
and only if it is a local minimizer of problem ([6]). It then implies that x* is a local minimizer of ([2|). ■ 

Remark. The second-order necessary or sufficient optimality conditions for problems ([1]) and ([2|) can 
be similarly established as above. ■ 
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3 A class of special Iq minimization 

In this section we show that a class of special Iq minimization problems have closed-form solutions, which 
can be used to develop penalty decomposition methods for solving general Iq minimization problems. 

Proposition 3.1 Let Xi C 3fJ and fa : M — > M for i = 1, . . . ,n be given. Suppose that r is a positive 
integer and G X% for all i. Consider the following Iq minimization problem: 



mm 



^2 M x i) : [kilo < r, x G X x x • • • x X n \. (8) 
I i=\ ) 

Let x* G Argmin{^j(xj) : Xj G X{\ and L* C {1, . . . , n} be the index set corresponding to r largest values 
°f i v t}?=i' where v* = fa(0) — fa{x*) for i = 1, . . . ,n. Then, x* is an optimal solution of problem {SJ), 
where x* is defined as follows: 

„ / x* ifi G /*; 
x i = \ n ^ ■ i = l,...,n. 

[ (J otherwise, 

Proof. By the assumption that G A^j for all i, and the definitions of x* , x* and J*, we see that 
x* E Xi X ■ ■ ■ X X n and ||x*||o < r - Hence, x* is a feasible solution of ([8]). It remains to show that 
<j)(x) > <j)(x*) for any feasible point a: of (JH])- Indeed, let x be arbitrarily chosen such that [|x[|o < v 
and x G X\ x ■ ■ ■ x X n , and let L = {i : Xi ^ 0}. Clearly, \L\ < r = \L*\. Let I* and 7 denote the 
complement of I* and 7 in {1, . . . , n}, respectively. It then follows that 

\lni*\ = |/*| — |7* n z,| > |7|-|7*n7| = 1 7 n 7 * | . 

In view of the definitions of x* , x*, I*, I*, L and 7, we further have 

<f>{x) - <f>(x*) = EieLnAM x i)-M x t)) + EieLnT*(M x i)-M x t)) 

+ EieLnAM x i) ~ M x *)) + EieLnl*(M x i) " M x t)), 

= T,ieLni*(M x i) ~ M X D) + Eie£ni*(^(°) ~ M°)) 

+ Ei 6 InJ.(^i(0) " M x *)) + EieLnI*(M x i) " MO)), 

> EieLnAMO) ~ M x *)) + Eiemi'iMZi) ~ MO)), 

= E l eLnl*(M0) ~ M x *)) -E^Lnl^MO) ~ M x *)) > 0, 

where the last inequality follows from the definition of I* and the relation \Lf] I*\ > |7n I*\. Thus, we 
see that <fi(x) > <j>(x*) for any feasible point x of ([8]), which implies that the conclusion holds. ■ 

It is straightforward to establish the following result. 

Proposition 3.2 Let Xi C 3ft and fa : K — > for i = 1, . . . , n 6e given. Suppose that v > and G Afj 

/or a// i. Consider the following Iq minimization problem: 

min < ^||x||o + fa{xi) '■ x & Xi x ■ ■ ■ x X n > . (9) 

-Let x* G Argmin{0j(xj) : Xj G A^} and v* = fa(0) — v — fa(x*) for i = 1, . . . , n. Then, x* is an optimal 
solution of problem (GJ>, where x* is defined as follows: 

x* ifv*>0; . = 1 
otherwise, 
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4 Penalty decomposition methods for general Iq minimization 



In this section we propose penalty decomposition (PD) methods for solving general Iq minimization 
problems (JTJ) and and establish their convergence. Throughout this section, we make the following 
assumption for problems §1$) and ([2]). 

Assumption 1 Problems $7$ and p|) are feasible, and moreover, at least a feasible solution, denoted 
by x feaf % is known. 

This assumption will be used to design the PD methods with nice convergence properties. It can be 
dropped, but the theoretical convergence of the corresponding PD methods may become weaker. We 
shall also mention that, for numerous real applications, x leas is readily available or can be observed from 
the physical background of problems. For example, all application problems discussed in Section [5] have 
a trivial feasible solution. On the other hand, for some problems which do not have a trivial feasible 
solution, one can always approximate them by the problems which have a trivial feasible solution. For 
instance, problem ([1]) can be approximately solved as the following problem: 

min{/(x) + /9(||n + || 2 + ||i>|| 2 ) : g{x) — u < 0, h(x) — v = 0, \\x j\\q < r} 
for some large p. The latter problem has a trivial feasible solution when X is sufficiently simple. 
4.1 Penalty decomposition method for problem (OQ) 

In this subsection we propose a PD method for solving problem ([I]) and establish its convergence. 
We observe that (P) can be equivalently reformulated as 

min {/(a) : g(x) < 0, h(x) = 0, xj - y = 0}, (10) 

where 

y = {ye^ Jl : \\y\\o<r}. 
The associated quadratic penalty function is defined as follows: 

qe ( x ,y) = f( x ) + ^\\[g( x )}+f + \\h(x)\\ 2 + \\xj-y\\ 2 ) VxeX,yey (11) 

for some penalty parameter g > 0. 

We are now ready to propose a PD method for solving problem (|10p (or equivalently, ([1])) in which 
each penalty subproblem is approximately solved by a block coordinate descent (BCD) method. 

Penalty decomposition method for (fTl) : 

Let {efc} be a positive decreasing sequence. Let qq > 0, a > 1 be given. Choose an arbitrary y® € y 
and a constant T > max{/(x feas ), min^g^ q eo (x, y®)}- Set k = 0. 

1) Set I = and apply the BCD method to find an approximate solution (x k ,y k ) € X x y for the 
penalty subproblem 

mm{q 8h (x,y) : xeX, y G y} (12) 

by performing steps la)-ld): 
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la) Solve xf +l G Aigmmq ek (x,y k ). 
lb) Solve yf +1 G Argmmq g (xf +v y). 

y&y 

lc) Set (x k ,y k ) := (xf +1 ,yf +1 ). If (x k ,y k ) satisfies 

\\V x {x k -V x q ek (x k ,y k ))-x k \\ < e k , (13) 

then go to step 2). 
Id) Set I <— I + 1 and go to step la). 

2) Set ftt+i := o-£ fc . 

3) If mmq gk+1 (x,y k ) > T, set y k+1 := x fcas . Otherwise, set y k+1 := y fe . 

4) Set k <— k + 1 and go to step 1). 
end 

Remark. The condition (| 13j) will be used to establish the global convergence of the above method. 
It may not be easily verifiable unless X is simple. On the other hand, we observe that the sequence 
{qg k (xf ,y k )} is non-increasing for any fixed k. In practice, it is thus reasonable to terminate the BCD 
method based on the progress of {q gk (x k ,y k )}. Another practical termination criterion for the BCD 
method is based on the relative change of the sequence {(xf,yf)}, that is, 

{I I rrtk ™fc I HlA^ II 1 

W^l x Z-llloo Will Ul-lWoo I < e/ 
max(||xf Hoc, 1) ' max(||yf ||oo, 1) J _ 

for some ej > 0. In addition, we can terminate the outer iterations of the PD method once 

||x fc -/||oo < e Q (15) 



for some to > 0. Given that problem (|12p is nonconvex, the BCD method may converge to a stationary 
point. To enhance the performance of the BCD method, one may execute it multiple times by restarting 
from a suitable perturbation of the current best approximate solution. For example, at the kth outer 
iteration, let (x k ,y k ) be the current best approximate solution of (fl~2|) found by the BCD method, and let 
r k = ||y fc ||o- Assume that > 1. Before starting the (/c+l)th outer iteration, one can re-apply the BCD 
method starting from y$ G Argmin{||y — y k \\ : \\y\\o < — 1} and obtain a new approximate solution 
(x k ,y k ) of (fl~2j) . If q gh (x k ,y k ) is "sufficiently" smaller than q Qk (x k ,y k ), one can set (x k ,y k ) := (x k ,y k ) 
and repeat the above process. Otherwise, one can terminate the kth outer iteration and start the 
next outer iteration. Finally, it follows from Proposition 13.11 that the subproblem in step lb) has a 
closed-form solution. ■ 

We next establish a convergence result regarding the inner iterations of the above PD method. In 
particular, we will show that an approximate solution (x k ,y k ) of problem ()12p satisfying (|13l) can be 
found by the BCD method described in steps la)-ld). For notational convenience, we omit the index k 
from (I12j) and consider the BCD method for solving the problem 

min{<^(x,y) : x G X, y G y} (16) 
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instead. Accordingly, we rename the iterates of the above BCD method and present it as follows. 

Block coordinate descent method for (I16j) : 

Choose an arbitrary initial point y° £ y. Set / = 0. 

1) Solve x l+l £ Argminq , £ ,(x, y l ). 

x£X 

2) Solve y l+1 £ Kigminq e {x lJrl ,y) . 

3) Set I <— I + 1 and go to step 1). 
end 

Lemma 4.1 Suppose that (x*,y*) G 5R n x Sft'"' is a saddle point of problem (|16p . that is, 

x* G Argmmq g (x,y*), y* G Argming e (x*, y). (17) 

Furthermore, assume that h's are affine functions, and f and g's are convex functions. Then, (x*,y*) 
is a local minimizer of problem (|16|) . 

Proof. Let K = {i : y* ^ 0}, and let h x , h y be any two vectors such that x* + h x £ X, y* + h y £ y 
and < \y*\ for all i £ K. Claim that 

(y*-x*j) T h y = 0. (18) 

If ||x}||o > r, we observe from the second relation of (fTTj) and Proposition 13.11 that ||y*||o = r and 
y* = x*j(ij for all i £ K, which, together with y* + h y £ y and < \y*\ for all i £ K, implies that 

(hy)i = for all i ^ K and hence (|18p holds. On the other hand, if ||x}||o < r, one can observe that 
y* = x*j and thus (|18p also holds. In addition, by the assumption that h's are affine functions, and / 
and ^'s are convex functions, we know that q e is convex. It then follows from the first relation of (|17p 
and the first-order optimality condition that [V x Og(x* ,y*)] T h x > 0. Using this relation along with (|18|) 
and the convexity of q g , we have 

q e (x* + h x ,y* + h y ) > q e (x* ,y*) + [V x q g (x* ,y*)] T h x + [V y q g (x* ,y*)] T h y 

= q e {x*,y*) + [V x q Q (x*,y*)] T h x + g{y* -x*j) T h y > q Q (x*,y*), 

which together with the above choice of h x and h y implies that (x*,y*) is a local minimizer of (I16h . ■ 

Theorem 4.2 iyei {(x z , y')} 6e the sequence generated by the above BCD method, and let e > be given. 
Suppose that (x*,y*) is an accumulation point of{(x l ,y 1 )}. Then the following statements hold: 

(a) (x*,y*) is a saddle point of problem f)16|) . 

(b) There exists some I > such that 

\\Vx(x l - V x q e (x l ,y l ))-x l \\ < e. 
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(c) Furthermore, if h's are affine functions, and f and g's are convex functions, then (x*,y*) is a 
local minimizer of problem (|16p . 



Proof. We first show that statement (a) holds. Indeed, one can observe that 

q s {x l+ \y l ) < q e (x,y l ) Vx G X, (19) 

q e (x l J) < q e (x l ,y) \/y G y. (20) 

It follows that 

q e (x l +\y l+1 ) < q Q {x l+ \y l ) < q e {x\y l ) W > 1. (21) 

Hence, the sequence {q g (x , y )} is non-increasing. Since (x*,y*) is an accumulation point of {(x l ,y 1 )}, 
there exists a subsequence L such that limi € L^ 00 (x l , y) = (x*,y*). We then observe that {q s {x l , y l )}ieL 
is bounded, which together with the monotonicity of {q g (x l , y 1 )} implies that {q g (x l , y 1 )} is bounded 
below and hence linn^oo q Q (x l , y l ) exists. This observation, (j2Tj) and the continuity of q g (-,-) yield 

lim q g (x l+1 ,y l ) = lim q e (x l ,y l ) = lim q g {x l ,y l ) = q Q (x*,y*). 

I— >oo l— >oo (GL— >oo 

Using these relations, the continuity of q g (-,-), and taking limits on both sides of (|19p and (|20j) as 
I £ L — > oo, we have 

q e (x*,y*) < q e (x,y*) VxGX, (22) 

q e {x\y*) < q e (x*,y) Vy G y. (23) 

In addition, from the definition of y, we know that \\y l \\o < which immediately implies ||2/*||o < r - 
Also, x* £ X due to the closedness of X. This together with (|22l) and (|23"j) implies that (x*,y*) is a 
saddle point of (|16p and hence statement (a) holds. Using (|22p and the first-order optimality condition, 
we have 

\\V x (x* -V x q g (x*,y*))-x*\\=0. 
By the continuity of Vx(-) and V x q g (-, •), and the relation lim; g i_ >00 (x', y l ) = (x*, y*), one can see that 

lim \\V x (x l - V x q e {x l ,y l ))-x l \\ =0, 

and hence, statement (b) immediately follows. In addition, statement (c) holds due to statement (a) 
and Lemma l4.1i ■ 

The following theorem establishes the convergence of the outer iterations of the PD method for 
solving problem ([1]). In particular, we show that under some suitable assumption, any accumulation 
point of the sequence generated by the PD method satisfies the first-order optimality conditions of ([T]). 
Moreover, when the Iq part is the only nonconvex part, we show that under some assumption, the 
accumulation point is a local minimizer of ([T]). 

Theorem 4.3 Assume that — > 0. Let {(x k ,y k )} be the sequence generated by the above PD method, 
Ik = • • • ,i k } be a set of r distinct indices in {1, ... , \ J\} such that (y k )i = for any i ^ Ik, and 
let Jk = {J(i) '■ i G Ik}- Suppose that the level set Xy := {x G X : f(x) < T} is compact. Then, the 
following statements hold: 

(a) The sequence {(x k ,y k )} is bounded. 
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(b) Suppose (x*,y*) is an accumulation point of{(x k ,y k )}. Then, x* = y* and x* is a feasible point of 
problem |7]). Moreover, there exists a subsequence K such that {(x k ,y k )}keK — > ( x *,y*), Ik = I* 
and Jfc = J* := {J(i) : i G I*} for some index set I* C {1, . . . , |J|} when k G K is sufficiently 
large. 

(c) Let x* , K and J* be defined above, and let J* = J\J*. Suppose that the Robinson condition ([3]) 
holds at x* for such J*. Then, {(X k , ^j, k ,uj k )}keK is bounded, where 

\ k = Qk[g(x k )} + , fi k = g k h(x k ), w k = 6k (x k -y k ). (24) 

Moreover, each accumulation point (A*, /x*, w*) of {(X k , fi k ,w k )}keK together with x* satisfies the 
first-order optimality conditions ([5]) with z*- = w* for all j = J(i) G J*. Further, if ||xj||o = r, h's 
are affine functions, and f and g 's are convex functions, then x* is a local minimizer of problem 

Proof. In view of (|lip and our choice of j/g that is specified in step 3), one can observe that 

f(x h ) + f (||b(^)] + H 2 + |IM^)H 2 + 114 " /II 2 ) = Q ek (x k ,y k ) < mmq gk (x,y k ) < T Vfc. (25) 

It immediately implies that {x k } C Xy, and hence, {x k } is bounded. Moreover, we can obtain from 
(1251) that 

\\x k - y k \\ 2 < 2[T - f(x k ))/ Qk < 2[T - min f(x)}/g , 

which together with the boundedness of {x k } yields that {y k } is bounded. Therefore, statement (a) 
follows. We next show that statement (b) also holds. Since (x*,y*) is an accumulation point of 
{(x k ,y k )}, there exists a subsequence {(x k ,y k )} k€ K ~^ { x *iV*)- Recall that I k is an index set. It 
follows that {(i k , ■ ■ ■ , i k )} keR ^ s bounded for all k. Thus there exists a subsequence K C K such 
that {(i k , . . . , i k )}k£K — > • • • j i*) for some r distinct indices i\, . . . ,i*. Since i k , . . . , i k are r dis- 
tinct integers, one can easily conclude that (i k , . . . ,i k ) = for sufficiently large k G K. Let 
I* = {i*, . . . ,i*}. It then follows that I k = I* and J k = J* when k G K is sufficiently large, and 
moreover, {(x k , y k )}keK ~ > { X *,U*)- Therefore, statement (b) holds. Finally, we show that statement 
(c) holds. Indeed, let s k be the vector such that 

V X (x k -V x q Qk (x k ,y k ))=x k + s k . 

It then follows from (|13p that \\s k \\ < e k for all k, which together with lim^oo e k = implies lim^oo s k = 
0. By a well-known property of the projection map Vx, we have 

(x-x k -s k f[x k - S7 x q gk (x k ,y k )-x k -s k } < 0, Vx G X. 

Hence, we obtain that 

- V x q e „(x k , y k ) -s k e Afx(x h + s k ). (26) 
Using this relation, (|26p . (|24p and the definition of q e , we have 

- Vf(x k ) - Vg(x k )X k - Vh(x k )fi k - ljw k - s k G Af x (x k + s k ). (27) 

We now claim that {(X k , fi k , w k )} k &K is bounded. Suppose for contradiction that it is unbounded. By 
passing to a subsequence if necessary, we can assume that {\\(X k , fj, k ,w k )\\}keK — > oo. Let (X k , Jl k , w k ) = 



10 



(X k , fj, k ,w k )/\\(X k , n k ,w k )\\. Without loss of generality, we assume that {(X k , ft k , w k )}k£K — > (X,ft, w) 
(otherwise, one can consider its convergent subsequence). Clearly, ||(A, ft, w)\\ = 1. Dividing both sides 
of (|2"T|) by || (A ,/j, k ,m )||, taking limits as k G K — > oo, and using the relation liuik^K^-oo s k = and 
the semicontinuity of Nx(-)i we obtain that 

- Vg{x*)X - Vh(x*)ft - Ijw G Nx{x*). (28) 

We can see from (jU and (|2~3| that A G 3i+, and Aj = for % £ A(x*). Also, from Proposition 13.11 and the 
definitions of y^, Ik and Jk, one can observe that x k k = y k and hence w\ = 0. In addition, we know 
from statement (b) that Ik = I* when k G K is sufficiently large. Hence, wj* = 0. Since Robinson's 
condition ([3]) is satisfied at x* , there exist d G 7x( x *) an d v G 3ft" 1 such that Vi < for i G >4(x*), and 

g'(x*)d — v = — A, h'(x*)d = —ft, (I j«) T d = —m j* , 

where J* is the complement of /* in {1, . . . , | J\}. Recall that A G K+, Aj = for i ^ ^4(x*), and < 
for i G A(x*). Hence, v T X < 0. In addition, since wj* = 0, one has Ijw = Ij*wj*. Using these 
relations, (|28|) . and the facts that d G l~x{x*) and otj* = 0, we have 

||A|| 2 + ||A|| 2 + |M| 2 = -{{-XfX+i-Jifft + i-wj^wj,} 

= -[(g'(x*)d - v) T X + (h'(x*)d) T ft + ((Ij,) T d) T wj,] 
= d T (-Vg(x*)X- Vh{x*)ft- Ijm) +v T X < 0. 

It yields (X,ft,w) = (0,0,0), which contradicts the identity ||(A,/i,ro)|| = 1. Therefore, the subsequence 
{(X k , fi k , zu k )}k£K is bounded. Let {X* , jj* ,w*) be an accumulation point of {(A fc , /i fc , w k )}k^K- By 
passing to a subsequence if necessary, we can assume that (X k ,fi k ,w k ) — > (X*,fi*,w*) as k G K — > oo. 
Taking limits on both sides of (|27h as k G K — > oo, and using the relations lim/ cg ^'_ s . 0O s k = and 
the semicontinuity of J\fx(-), we see that the first relation of © holds with z* = Ijw*. By a similar 
argument as above, one can show that wf* = 0. This together with the definitions of J* and J* implies 
that z* satisfies 

* = f if j G JU J*, 
Zj \™t if j = J(i)€ J*, 

where J is the complement of J in {1, . . . , n}. In addition, we see from plD that X k > and X k gi(x k ) = 
for all i, which immediately lead to the second relation of (|5|). Hence, {X* , ft* ,w*) together with x* 
satisfies ©. Suppose now that ||x}|| = r. Then, J* = {J* C J : |J*| = r,x* = 0,Vj ^ J*} = {J*}. 
Therefore, the assumptions of Theorem 12.31 hold. It then follows from Theorem 12.31 that x* is a local 
minimizer of (TjQ). ■ 

4.2 Penalty decomposition method for problem ([2]) 

In this subsection we propose a PD method for solving problem (|2|) and establish some convergence 
results for it. 

We observe that problem ([2]) can be equivalently reformulated as 

min {f(x) + v\\y\\ : g(x) < 0, h(x) = 0, xj-y = 0}. (29) 
xex,ye$t\ J \ 

The associated quadratic penalty function for (|29|) is defined as 

Pe {x,y) := f(x) + v\\y\\ + |(|| [g(x)} + \\ 2 + ||^(^)|| 2 + \\xj - y\\ 2 ) VxeX,ye ^ (30) 
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for some penalty parameter g > 0. 

We are now ready to present the PD method for solving (|29j) (or, equivalently, ([2])) in which each 
penalty subproblem is approximately solved by a BCD method. 

Penalty decomposition method for 

Let {efc} be a positive decreasing sequence. Let go > 0, a > 1 be given, and let q e be defined in (fTTj) . 
Choose an arbitrary y® G 3?'^' and a constant T such that T > max{/(x feas )+z/||x feas ||o, mm xe x p go (x, Uq)}. 
Set k = 0. 

1) Set / = and apply the BCD method to find an approximate solution (x k ,y k ) G X x 9?^ for the 
penalty subproblem 

mm{p gk (x,y) : x G X, y G 5ft |J| } (31) 

by performing steps la)-ld): 
la) Solve x k +1 G Argminp efc (x, yf). 
lb) Solve G Arg min p fife (a;* +1 , y). 
lc) Set (x k ,y k ) := (x k +l ,y k +l ). If (x k ,y k ) satisfies 

||^(x fe - V x q gk (x k ,y k )) - x k \\ < e k , (32) 

then go to step 2). 
Id) Set I <— I + 1 and go to step la). 

2) Set g k+ i := ag k . 

3) If mmp gk+1 (x,y k ) > T, set y^ +l := x ic&s . Otherwise, set y$ +l := y k . 

4) Set k A; + 1 and go to step 1). 
end 

Remark. The practical termination criteria proposed in Subsection 14.11 can also be applied to this 
PD method. In addition, one can apply a similar strategy as mentioned in Subsection 14.11 to enhance 
the performance of the BCD method for solving (|31|) . Finally, in view of Proposition 13.21 the BCD 
subproblem in step lb) has a closed-form solution. ■ 

We next establish a convergence result regarding the inner iterations of the above PD method. In 
particular, we will show that an approximate solution (x k ,y k ) of problem (|3ip satisfying f|32|) can be 
found by the BCD method described in steps la)-ld). For convenience of presentation, we omit the 
index k from (|3ip and consider the BCD method for solving the following problem: 

va.m{p e (x,y) : x G X, y G K |J| } (33) 

instead. Accordingly, we rename the iterates of the above BCD method. We can observe that the 
resulting BCD method is the same as the one presented in Subsection 14.11 except that p Q and 3?'^' 
replace q Q and y, respectively. For the sake of brevity, we omit the presentation of this BCD method. 
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Lemma 4.4 Suppose that (x*,y*) £ ffi 1 x 3?^' is a saddle point of problem (|33p . that is, 

x* £ Argminp„(x, y*), y* £ Arg min p g (x*,y). (34) 

Furthermore, assume that h's are affine functions, and f and g's are convex functions. Then, (x*,y*) 
is a local minimizer of problem (|33l) . 

Proof. Let K = {i : y* ^ 0}, and let h x , h y be any two vectors such that x* + h x £ X, \(h y )i\ < 
u/{p\x*j^\ + 1) for any i ^ K and < \y*\ for all i £ K. We observe from the second relation 

of (|34p and Proposition 13.21 that y* = for all i £ K. Also, for the above choice of h y , one has 
y* + (/ijji 7^ for all i £ K. Hence, \\y* + (h y )i\\o = \\y*\\o for every i £ K. Using these relations and 
the definition of h y , we can see that 

P(V* -x*j) T h y + u\\y* +h y \\ -u\\y*\\ = x*j(i){h y )i + [|(M«[lo ^ °- ( 35 ) 

In addition, by the assumption that h's are affine functions, and / and <?'s are convex functions, we 
know that q g is convex, where q g is defined in ([TT]) . Also, notice that p g (x, y) = q g (x, y) + v\\y\\^. It then 
follows from the first relation of (|34p and the first-order optimality condition that [V x q g (x*, y*)] T h x > 0. 
Using this relation along with (|35p and the convexity of q e , we have 

p g (x* + h x ,y* + h y ) = q e (x* + h x ,y* + h y ) + v\\y* + h y \\ 

> q e (x*,y*) + [V x q g (x*,y*)] T h x + [V y q e (x* , y*)] T h y + v\\y* + h y \\ 

> Pg{x*,y*) + g{y* - x*j) T h y + u\\y* + h y \\ - u\\y*\\ >p g (x*,y*), 

which together with the above choice of h x and h y implies that (x*,y*) is a local minimizer of ()33|) . ■ 

Theorem 4.5 Ze£ {(a/, ?/)} 6e the sequence generated by the above BCD method, and let e > be given. 
Suppose that (x*,y*) is an accumulation point of {(x l , y 1 )} . Then the following statements hold: 

(a) (x*,y*) is a saddle point of problem ()33p , 

(b) There exists some I > such that 

\\V X {x l -V x q e (x l J)) - x l \\ <e, 
where the function q e is defined in (|lip . 

(c) Furthermore, if h's are affine functions, and f and g's are convex functions, then (x*,y*) is a 
local minimizer of problem (I33p . 

Proof. We first show that statement (a) holds. Indeed, one can observe that 

P g (x l+ \y l ) < Pg (x,y l ) Mx £ X , (36) 

P g {x l J) < Pg (x l ,y) Vy€»' J L (37) 

It follows that 

p g {x l+ \y l+1 ) < Pg {x l+ \y l ) < p g {x l J) VZ>1. (38) 
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Hence, the sequence {p Q {x l ,y 1 )} is non-increasing. Since (x*,y*) is an accumulation point of {(x l ,y 1 )}, 
there exists a subsequence L such that lim,i^L- i . 00 (x l , y ) = (x*,y*), and moreover, x* G due to 
the closedness of X. We then observe from (|3"0"j) that {p e (x z , y l )}i^L is bounded, which together with 
the monotonicity of implies that {p g (x l ,y 1 )} is bounded below and hence lim;_ i>00 p £ ,(x i , y') 

exists. This observation and (|38|) yield 



hmp e (x l ,y l ) = \imp e (x l+ \y l ). (39) 

f— too I— too 



For notational convenience, let 



It then follows from (1301) that 



> F(x*) + v\\y*\\ + ^\\x*j-y*\\ 2 = P e (x*,V*), Vj, € »' J I. (41) 



F(s) :=/(x) + |(||[ 5 ( 2 ;)] + || 2 + ||/ l (x)|| 2 ). 



p e (x,y) = F(x) + ^||y||o + |||xj-y|| 2 , Vx G y G »I J '. (40) 

Since lirm e £y = V* \ one has \\y l \\o > l|y*||o for sufficiently large / G L. Using this relation, (j37|) and 
(|40p . we obtain that, when / G L is sufficiently large, 

PffCs 1 ,!/) > P e {x\y l ) = F(x l ) + u\\y l \\ + ^\\x l j-y l \\ 2 > F(x l ) + u\\y*\\ + |||x'j - y l \\ 2 . 

Upon taking limits on both sides of the above inequality as I G L — > oo and using the continuity of F, 
one has 

„.*I|2 

In addition, it follows from (|36p and (|40p that 

F(x) + i||xj - y'|| 2 = p e (x,y l ) - v\\y% > p e (x l+1 ,y l ) - v\\y l \\o 

= F(x l+l ) + \\\x l + l -y l f, VxG*. ^ ' 

Since {||y'||o}zeL is bounded, there exists a subsequence L C L such that lim^^^^ ||y ||o exists. Then 
we have 

lim F(x' +1 ) + lllx'/ 1 -y l \\ 2 = lim p e (x /+1 , y z ) - v\\y l \\ 
leL— too leL— too 

= lim p Q (x l+1 , y l ) — v lim ||y*||o = lj m Pg{x\y l ) — v lj m H^/llo 

IdL— too l£L— too l£L—too l£L— too 

= lim p e {x l ,y l ) - v\\y l \\ = lim F(x z ) + \\\x l j - y l \\ 2 = F(x*) + \\\x*j - y*\\ 2 , 

l&L— too l£L— too 

where the third equality is due to (|39|) . Using this relation and taking limits on both sides of (|42p as 
/ G L — > oo, we further have 

F(x) + ^\\xj-y*f > F{x*)+ l -\\x*j-y*\\ 2 , Vx G X, 



which together with (|30p yields 

Pg{x,y*) > p g (x*,y*), \/xeX. 
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This relation along with (|4ip implies that (x*,y*) is a saddle point of f|33[) and hence statement (a) 
holds. Statement (b) can be similarly proved as that of Theorem 14, 21 In addition, statement (c) holds 
due to statement (a) and Lemma 14.41 ■ 



Remark. A similar result as in statement (c) is recently established in |58| for the BCD method 
when applied to solve the unconstrained problem: 

min-\\Ax - b\\ 2 + %\\Wx - y\\ 2 + V Vi\\yi\\o (43) 
x,y 2 2 

i 

under the assumption that A T A y 0, W T W = I, g > 0, and > for all i. The proof of [581 strongly 
relies on this assumption and the fact that the BCD subproblems have closed-form solutions. We believe 
that it cannot be extended to problem f|33f) . In addition, it is not hard to observe that problem (|43|) 
can be equivalently reformulated into a problem in the form of f|33|) and thus the convergence of the 
BCD method for (|43p directly follows from Theorem 14.51 above. ■ 

We next establish the convergence of the outer iterations of the PD method for solving problem ([2]). 
In particular, we show that under some suitable assumption, any accumulation point of the sequence 
generated by the PD method satisfies the first-order optimality conditions of ([2]). Moreover, when the 
Iq part is the only nonconvex part, we show that the accumulation point is a local minimizer of ([2]). 

Theorem 4.6 Assume that e& — > 0. Let {(x k ,y k )} be the sequence generated by the above PD method. 
Suppose that the level set Xt := {x G X : f{x) < T} is compact. Then, the following statements hold: 

(a) The sequence {(x k ,y k )} is bounded; 

(b) Suppose (x*,y*) is an accumulation point of {(x k ,y k )}. Then, x* = y* and x* is a feasible point 
of problem 

(c) Let (x*,y*) be defined above. Suppose that {(x k ,y k )}k^K (x*,y*) for some subsequence K. Let 
J* = {j € J : x*j 0}, J* = J \ J* . Assume that the Robinson condition (J7|) holds at x* for such 
J*. Then, {(\ k , fi k ,w k )}keK is bounded, where 

A fc = Qk[g{x k )} + , H k = Q k h(x k ), w k = g k (x k j - y k ). 

Moreover, each accumulation point (X*,fi*,w*) of {(X k , n k , ^> k )}keK together with x* satisfies the 
first-order optimality condition ([5]) with z* = w* for all j = J(i) G J*. Further, if h's are affine 
functions, and f and g's are convex functions, then x* is a local minimizer of problem 

Proof. Statement (a) and (b) can be similarly proved as those of Theorem 14.31 We now show that 
statement (c) holds. Let L* = {i : J(i) G J*}. From Proposition 13.21 and the definitions of y k and J*, we 
can observe that y k , = x k , when k G K is sufficiently large. Hence, VD k * = for sufficiently large k G K. 
The rest of the proof for the first two conclusions of this statement is similar to that of statement (c) 
of Theorem 14.31 The last conclusion of this statement holds due to its second conclusion and Theorem 
12^1 ■ 
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5 Numerical results 



In this section, we conduct numerical experiments to test the performance of our PD methods proposed 
in Section H] by applying them to sparse logistic regression, sparse inverse covariance selection, and 
compressed sensing problems. The codes of all the methods implemented in this section are written in 
Matlab, which are available online at www.math.sfu.ca/~zhaosong. All experiments are performed in 
Matlab 7.11.0 (2010b) on a workstation with an Intel Xeon E5410 CPU (2.33 GHz) and 8GB RAM 
running Red Hat Enterprise Linux (kernel 2.6.18). 



5.1 Sparse logistic regression problem 

In this subsection, we apply the PD method studied in Subsection 14.11 to solve sparse logistic regres- 
sion problem, which has numerous applications in machine learning, computer vision, data mining, 
bioinformatics, and neural signal processing (see, for example, [3| 154" ! l32l l43| l22 j 144] ) . 

Given n samples {z 1 , . . . , z n } with p features, and n binary outcomes b±,...,b n , let a % = biz' 1 for 
i = 1, . . . , n. The average logistic loss function is defined as 

n 

Lvg(v,w) := 6(w T a % + vbjj/n 
i=i 

for some model variables tiGJf and w £ W, where 6 is the logistic loss function 

9{t) :=log(l + exp(-t)). 
Then the sparse logistic regression problem can be formulated as 

min{Z avg (u,u>) : \\w\\q < r} , (44) 

v,w 

where r G [l,p] is some integer for controlling the sparsity of the solution. In the literature, one common 
approach for finding an approximate solution to (jUJ) is by solving the following l\ regularization problem: 



minZavg^, w) + A||«;||i (45) 



v,w 



for some regularization parameter A > (see, for example, [28l HH1 [42], [30j EH [49] ) . Our aim below is 
to apply the PD method studied in Subsection 14.11 to solve (|4"4"|) directly. 

Letting x = (v,w), J = {2, . . . ,p + 1} and f(x) = la,vg(xi,xj), we can see that problem ([4"1"|) is in 
the form of ([T|). Therefore, the PD method proposed in Subsection 14. II can be suitably applied to solve 
(|44p . Also, we observe that the main computation effort of the PD method when applied to (j44p lies in 
solving the subproblem arising in step la), which is in the form of 

min{/ avg (xi,xj) + |||x-c|| 2 : x G 3? p+1 | (46) 

for some q > and c G To efficiently solve (|46p . we apply the nonmonotone projected gradient 

method proposed in [4, Algorithm 2.2]; in particular, we set its parameter M = 2 and terminate the 
method when ||Vi ? (x)||/max{|i ? (x)|, 1} < 10~ 4 , where F(x) denotes the objective function of (I46p . 

We now address the initialization and the termination criteria for our PD method when applied to 
(|44|) . In particular, we randomly generate z G such that ||^j||o < r and set the initial point y|] = z. 
We choose the initial penalty parameter £o to be 0.1, and set the parameter a = \/l0. In addition, we 
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use (|14p and (|15p as the inner and outer termination criteria for the PD method and set their accuracy 
parameters ej and eo to be 5 x 10~ 4 and 10 -3 , respectively. 

We next conduct numerical experiments to test the performance of our PD method for solving 
(|44p on some real and random data. We also compare the quality of the approximate solutions of (|44p 
obtained by our method with that of (|45|) found by a first-order solver SLEP |33j. For the latter method, 
we set opts.mFlag=l, opts.lFlag=l and opts.tFlag=2. And the rest of its parameters are set by default. 

In the first experiment, we compare the solution quality of our PD method with SLEP on three 
small- or medium-sized benchmark data sets which are from the UCI machine learning bench market 
repository [40] and other sources [23]. The first data set is the colon tumor gene expression data |23] 
with more features than samples; the second one is the ionosphere data [ID] with less features than 
samples; and the third one is the Internet advertisements data [40] with roughly same magnitude of 
features as samples. We discard the samples with missing data and standardize each data set so that 
the sample mean is zero and the sample variance is one. For each data set, we first apply SLEP to solve 
problem ()45p with four different values of A, which are the same ones as used in [28], namely, 0.5A max , 
0.1A max , 0.05A 

max) and 0.01A max , where A max is the upper bound on the useful range of A that is defined 
in |28j . For each such A, let w x be the approximate optimal w obtained by SLEP. We then apply our 
PD method to solve problem ()44p with r = \\wt\\o so that the resulting approximate optimal w is at 
least as sparse as w* x . 

To compare the solution quality of the above two methods, we introduce a criterion, that is, error 
rate. Given any model variables (v, w) and a sample vector z 6 W, the outcome predicted by (v, w) for 
z is given by 

<j)(z) = sgn(w T z + v), 

where 

' +1 if t > 0, 



sgn(t) 



T otherwise. 



Recall that z % and hi are the given samples and outcomes for i = 1, . . . ,n. The error rate of (v,w) for 
predicting the outcomes b\ , . . . , b n is defined as 



Error := i ^ - bt\\ /n \ x 100%. 



I i=l ) 

The computational results are presented in Table HJ In detail, the name and dimensions of each 
data set are given in the first three columns. The fourth column gives the ratio between A and its upper 
bound A max . The fifth column lists the value of r, that is, the cardinality of w* x which is defined above. 
In addition, the average logistic loss, the error rate and the CPU time (in seconds) for both SLEP and 
PD are reported in columns six to eleven. We can observe that, although SLEP is faster than the PD 
method in most cases, the PD method substantially outperforms SLEP in terms of the solution quality 
since it generally achieves lower average logistic loss and error rate while the sparsity of both solutions 
is the same. 

In the second experiment, we test our PD method on the random data sets of three different sizes. 
For each size, we randomly generate the data set consisting of 100 instances. In particular, the first 
data set has more features than samples; the second data set has more samples than features; and the 
last data set has equal number of features as samples. The samples {z 1 , . . . , z n } and the corresponding 
outcomes b\,...,b n are generated in the same manner as described in [28] , In detail, for each instance 
we choose equal number of positive and negative samples, that is, m + = m_ = m/2, where m + (resp., 
m_) is the number of samples with outcome +1 (resp., —1). The features of positive (resp., negative) 
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Table 1: Computational results on three real data sets 
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Table 2: Computational results on random data sets 



Size 








SLEP 






PD 




n x p 


A/A ma x 


r 


^avg 


Error (%) 


Time 


^avg 


Error (%) 


Time 


1000 x 2000 


0.9 


17.0 


0.6411 


9.76 


0.4 


0.2145 


8.49 


9.9 




0.7 


52.9 


0.5090 


3.96 


1.0 


0.0588 


2.66 


20.0 




0.5 


96.6 


0.3838 


2.23 


1.7 


0.0060 


0.02 


34.9 




0.3 


138.7 


0.2611 


1.22 


2.1 


0.0022 





25.5 




0.1 


192.0 


0.1228 


0.31 


2.0 


0.0013 





16.0 


2000 x 1000 


0.9 


11.0 


0.6441 


11.46 


0.4 


0.2763 


10.67 


15.2 




0.7 


42.8 


0.5083 


3.63 


1.1 


0.0376 


1.49 


38.9 




0.5 


78.0 


0.3776 


1.65 


2.0 


0.0032 





34.4 




0.3 


115.5 


0.2490 


0.6 


2.6 


0.0015 





25.3 




0.1 


160.8 


0.1056 


0.03 


3.1 


0.0010 





15.8 


1000 x 1000 


0.9 


11.7 


0.6417 


11.00 


0.1 


0.2444 


9.67 


2.3 




0.7 


37.2 


0.5086 


3.95 


0.2 


0.0572 


2.46 


5.8 




0.5 


67.6 


0.3805 


2.15 


0.3 


0.0060 


0.01 


6.2 




0.3 


100.1 


0.2544 


0.81 


0.4 


0.0016 





4.6 




0.1 


137.9 


0.1124 


0.12 


0.5 


0.0011 





3.3 



samples are independent and identically distributed, drawn from a normal distribution iV(/x, 1), where 
/i is in turn drawn from a uniform distribution on [0, 1] (resp., [—1,0]). For each such instance, similar 
to the previous experiment, we first apply SLEP to solve problem (|45p with five different values of A, 
which are 0.9A max , 0.7A max , 0.5A max , 0.3A max and 0.1A max . For each such A, let w\ be the approximate 
optimal w obtained by SLEP. We then apply our PD method to solve problem (|44p with r = ||u^||o so 
that the resulting approximate optimal w is at least as sparse as w\. The average results of each data 
set over 100 instances are reported in Table [2j We also observe that the PD method is slower than 
SLEP, but it has better solution quality than SLEP in terms of average logistic loss and error rate. 

In summary, the above experiments demonstrate that the quality of the approximate solution of 
(|44p obtained by our PD method is generally better than that of (|45p found by SLEP when the same 
sparsity is considered. This observation is actually not surprising as (|45p is a relaxation of (|44p . 

5.2 Sparse inverse covariance selection problem 

In this subsection, we apply the PD method proposed in Subsection 14.11 to solve the sparse inverse 
covariance selection problem, which has numerous real-world applications such as speech recognition 
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and gene network analysis (see, for example, [2| [T8] ) . 

Given a sample covariance matrix 5] G 5++ and a set consisting of pairs of known conditionally 
independent nodes, the sparse inverse covariance selection problem can be formulated as 

max logdetX- (S,X) 
s.t. Yl \\Xij\\o<r, (47) 

Xij = o v(i,j)en, 

where Cl = ■ £1, i ^ j}, and r G [1, is some integer for controlling the sparsity of the 

solution. In the literature, one common approach for finding an approximate solution to (|47p is by 
solving the following l\ regularization problem: 

max logdetX — (E,X) — Yl Pijl^ijl 

Xt0 (ij)efi (48) 

s.t. Xij = V(i,j)en, 

where {Pf?'}(ij)en is a set of regularization parameters (see, for example, [1^ 1 [T5 | [T} [35 j l36 j |2"T | [56 | [3^] ) . 
Our goal below is to apply the PD method studied in Subsection 14.11 to solve (jl7|) directly. 

Letting X = {X G 5+ : Xy = 0, G fl} and J = £l, we clearly see that problem ()47p is in the 

form of (HJ) and thus it can be suitably solved by the PD method proposed in Subsection 14.11 with 



y 



(i,j)eti 



Notice that the main computation effort of the PD method when applied to (I47p lies in solving the 
subproblem arising in step la), which is in the form of 



mm 



{- logdet X + ^\\X -Cf F : Xij =0V(i,j) G o} (49) 



for some q > and C G S p . Given that problem ()49|) generally does not have a closed-form solution, 
we now slightly modify the above sets X and y by replacing them by 



x = sl, y 



YgS p : ^ ll^-llo < r, = 0, G Vt 

(i,j)en 



respectively, and then apply the PD method presented in Subsection 14.11 to solve (|37|) . For this PD 
method, the subproblem arising in step la) is now in the form of 



{-logdetX + |||X-C||| :Ih0} (50) 
for some g > and C G S p . It can be shown that problem (|50p has a closed-form solution, which is given 



mm 

x 



by V3>(x*)V T , where x* = (Aj + J A? + 4/g)/2 for all i and F^(A)F T is the eigenvalue decomposition 
of C for some A 6 K p (see, for example, Proposition 2.7 of [37] ). Also, it follows from Proposition 13.11 
that the subproblem arising in step lb) for the above y has a closed-form solution. 
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We now address the initialization and the termination criteria for the above PD method. In partic- 
ular, we set the initial point Y Q ° = (i^(S)) _1 , the initial penalty parameter go = 1, and the parameter 
a = vlO. In addition, we use (fT5j) and 

\Qe k ( x i+i^i+i) ~ QQk( x i^i)\ < ei 
max{\q gk (xf,yf)\,l} ' 

as the outer and inner termination criteria for the PD method, and set the associated accuracy param- 
eters eo = 10 -4 and ei = 10 -4 , 10~ 3 for the random and real data below, respectively. 

We next conduct numerical experiments to test the performance of our PD method for solving (|47p 
on some random and real data. We also compare the quality of the approximate solutions of (|47p 
obtained by our method with that of (|48p found by the proximal point algorithm (PPA) |56j . Both 
methods call the LAPACK routine dsyevd.f [29\ for computing the full eigenvalue decomposition of a 
symmetric matrix, which is usually faster than the Matlab's eig routine when p is larger than 500. For 
PPA, we set Tol = 10~ 6 and use the default values for all other parameters. 

In the first experiment, we compare the solution quality of our PD method with PPA on a set 
of random instances which are generated in a similar manner as described in [141 [35], [36], [56] 134] . In 
particular, we first generate a true covariance matrix S* G 5++ such that its inverse (X! t )~ 1 is with the 
prescribed density <5, and set 

Q = {(i,j):(ll% 1 = 0, \i-j\>[p/2\}. 
We then generate a matrix B G S p by letting 

5 = 1]* + tV, 

where V £ S p contains pseudo-random values drawn from a uniform distribution on the interval [—1,1], 
and r is a small positive number. Finally, we obtain the following sample covariance matrix: 

E = £-min{A min (B)-tf,0}I, 

where i? is a small positive number. Specifically, we choose r = 0.15, $ = l.Oe — 4, 5 = 10%, 50% and 
100%, respectively. It is clear that for 5 = 100% and the set Q is an empty set. In addition, for all 
G we set pij = p& for some p& > 0. For each instance, we first apply PPA to solve (|48l) for 
four values of p^, which are 0.01, 0.1, 1, and 10. For each p^, let X* be the solution obtained by PPA. 
We then apply our PD method to solve problem (1471) with r = Y^(ij)en \\X*j\\o so that the resulting 
solution is at least as sparse as X*. 

As mentioned in [34J, to evaluate how well the true inverse covariance matrix (S*)" 1 is recovered 
by a matrix X G one can compute the normalized entropy loss which is defined as follows: 

Loss := -((£*, X) -logdet(E*X) -p). 
p 

The results of PPA and the PD method on these instances are presented in Tables [3]l5] respectively. In 
each table, the order p of S is given in column one. The size of O, is given in column two. The values 
of Pq and r are given in columns three and four. The log-likelihood (i.e., the objective value of (|47p ). 
the normalized entropy loss and the CPU time (in seconds) of PPA and the PD method are given in 
the last six columns, respectively. We observe that our PD method is substantially faster than PPA 
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Table 3: Computational results for S — 10% 



Problem 








PPA 






PD 






ii 


ril 


V 


Likelihood 


Loss 


Time 


Likelihood 


Loss 


Time 


500 


56724 


0.01 


lOOO I \J 


— Q^iO 

UOKJ • OO 


2.4594 


34.1 


—936.45 


9 3990 


2.5 






o in 


4^01 S 


_qqq sq 


9 c i74q 


44.8 


y i o.U-L 


2.4498 


5 3 






i nn 


JU4:U 


— 1 04fi 44 


9 Q1 qn 


UU.jj 


— 1 0^9 7q 


9 fi^sn 


24.8 






10.0 


2608 


— 1471.67 


4.2442 


75.1 


— 1129.50 


2.8845 


55.5 


1000 


226702 


0.01 


745470 


—2247.14 


3.1240 


150.2 


—9990 47 


3.0486 


13.1 






n i n 

U. 1-U 


1 8fifin9 

-lOUUU^ 


— 9^44 n^ 


3.2291 


1 ^8 7 




3.1224 


1 Q 8 






i nn 

l.UU 


29110 


ilUU.OO 


O.lIUOI 


^40. 8 






5Q 1 

Oiy . .1 






10.0 


9604 


-3094.57 


4.6834 


395.9 


-2515.80 


3.4243 


129.5 


1500 


509978 


0.01 


1686128 


-3647.71 


3.4894 


373.7 


-3607.23 


3.4083 


35.7 






0.10 


438146 


-3799.02 


3.5933 


303.6 


-3731.17 


3.5059 


44.9 






1.00 


61222 


-3873.93 


3.8319 


907.4 


— 3832 88 


3 6226 


155.3 






i n n 

1U.U 


17360 


-4780.33 


4.9264 


698.8 


— oyz^. y^t 


3 71 /Ifi 


■398 n 


2000 


905240 


0.01 


3012206 


-5177.80 


3.7803 


780.0 


—51 26 OQ 


3.7046 


65.5 






n i n 

U. 1-U 


822714 


-5375.21 


3.8797 


657.5 


— ^989 37 


^ 7Qm 

o. i yui 


q4 3 






1.00 


126604 


-5457.90 


4.0919 


907.4 


— 5494 66 


3.9713 


200.2 






i n n 


29954 


-6535.54 


5.1130 


1397.4 


— ^^9 0^ 


4 nm q 


^88 n 

iJOO.U 








Table 4: Computational results for 6 - 


- 50% 






Problem 








PPA 






PD 




P 


|Q| 


£>(=> 


r 


Likelihood 


Loss 


Time 


Likelihood 


Loss 


Time 


500 


37738 


0.01 


202226 


-947.33 


3.1774 


37.2 


—935.11 


3.1134 


2.2 






n i n 

U. 1-U 


50118 


-1001.23 


3.3040 


41.8 


y i o.uo 


"\ 1 fi(S9 

O.lUUi 


4 7 






i nn 

l.UU 


11810 


-1052.09 


3.6779 


81.1 


— 101 SO 


"\ 988Q 


1 4 






10.0 


5032 


-1500.00 


5.0486 


71.1 


-1041.64 


3.3966 


28.1 


1000 


152512 


n ni 


816070 


-2225.875 


3.8864 


149.7 


_ 99m Q8 


O.OliU 


12.1 






n i n 

U. 1U 


203686 


-2335.81 


4.0029 


131.0 


— 9988 1 1 
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i nn 

l.UU 


46928 


-2400.81 


4.2945 


372.7 


^o^y .u^ 


4 nns^ 


44.1 






10.0 


17370 


-3128.63 


5.5159 


265.2 


-2390.09 


4.1138 


84.3 


1500 


340656 


0.01 


1851266 


-3649.78 


4.2553 


361.2 


-3616.72 


4.1787 


32.0 






0.10 


475146 


-3815.09 


4.3668 


303.4 


-3743.19 


4.2725 


42.3 






1.00 


42902 


-3895.09 


4.6025 


1341.0 


-3874.68 


4.4823 


155.8 






10.0 


7430 


-4759.67 


5.6739 


881.2 


-4253.34 


4.6876 


468.6 


2000 


605990 


0.01 


3301648 


-5149.12 


4.5763 


801.3 


-5104.27 


4.5006 


61.7 






0.10 


893410 


-5371.26 


4.6851 


620.0 


-5269.06 


4.5969 


82.4 






1.00 


153984 


-5456.54 


4.9033 


1426.0 


-5406.89 


4.7614 


175.9 






10.0 


33456 


-6560.54 


5.9405 


1552.3 


-5512.48 


4.7982 


565.5 



for these instances. Moreover, it outperforms PPA in terms of solution quality since it achieves larger 
log-likelihood and smaller normalized entropy loss. 

Our second experiment is similar to the one conducted in |14| [36] . We intend to compare sparse 
recoverability of our PD method with PPA. To this aim, we specialize p = 30 and (S*) -1 G . to be 
the matrix with diagonal entries around one and a few randomly chosen, nonzero off-diagonal entries 
equal to +1 or —1. And the sample covariance matrix S is then similarly generated as above. In 
addition, we set Q = '■ (S*)7J- =0, \i — j\ > 15} and p-ij = for all £ where Pq is 

the smallest number such that the approximate solution obtained by PPA shares the same number of 
nonzero off-diagonal entries as (S*) -1 . For problem (|4Tjl . we choose r = IK^*)^ [ 1 (i-e., the 

number of nonzero off-diagonal entries of (Xl t ) _1 ). PPA and the PD method are then applied to solve 
(|48p and (|47p with the aforementioned pij and r, respectively. In Figure [TJ we plot the sparsity patterns 
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Table 5: Computational results for 5 — 100% 
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PPA 






PD 






ii 
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Time 
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Loss 


Time 
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4 fi^n4 
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10.0 
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-3127.94 


5.8521 


244.1 


-2388.22 


4.4466 


79.0 


1500 





0.01 


2181060 


-3585.21 


4.5878 


364.1 


-3545.43 


4.5260 


12.3 






0.10 


551150 


-3806.07 


4.7234 


288.2 


-3717.25 


4.6059 


41.3 






1.00 


102512 


-3883.94 


4.9709 


912.8 


-3826.26 


4.7537 


93.5 






10.0 


31526 


-4821.26 


6.0886 


848.7 


-3898.50 


4.8824 


185.4 


2000 





0.01 


3892592 


-5075.44 


4.8867 


734.1 


-5021.95 


4.8222 


23.8 






0.10 


1027584 


-5367.86 


5.0183 


590.6 


-5246.45 


4.9138 


76.1 






1.00 


122394 


-5456.64 


5.2330 


1705.8 


-5422.48 


5.1168 


197.8 






10.0 


25298 


-6531.08 


6.2571 


1803.4 


-5636.74 


5.3492 


417.1 
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(c) Approximate solution of ||48|I (d) Approximate solution of 1471 

Figure 1: Sparse recovery. 





of the original inverse covariance matrix (X! t ) -1 , the noisy inverse sample covariance matrix and 
the approximate solutions to (j48]) and (j47]) obtained by PPA and our PD method, respectively. We first 
observe that the sparsity of both solutions is the same as (Xl*) -1 . Moreover, the solution of our PD 
method completely recovers the sparsity patterns of (S*) -1 , but the solution of PPA misrecovers a few 
patterns. In addition, we present the log-likelihood and the normalized entropy loss of these solutions 
in Table [6l One can see that the solution of our PD method achieves much larger log-likelihood and 
smaller normalized entropy loss. 
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Table 6: Numerical results for sparse recovery 





nnz 


Likelihood 


Loss 


PPA 
PD 


24 
24 


-35.45 
-29.56 


0.178 
0.008 



Table 7: Computational results on two real data sets 



Data 


Genes 


Samples 








PPA 






PD 






P 


n 


Pn 


r 


Likelihood 


Loss 


Time 


Likelihood 


Loss 


Time 


Lymph 


587 


148 


0.01 


144294 


790.12 


23.24 


101.5 


1035.24 


22.79 


38.0 








0.05 


67474 


174.86 


24.35 


85.2 


716.97 


23.27 


31.5 








0.10 


38504 


-47.03 


24.73 


66.7 


389.65 


23.85 


26.1 








0.50 


4440 


-561.38 


25.52 


33.2 


-260.32 


24.91 


24.8 








0.70 


940 


-642.05 


25.63 


26.9 


-511.70 


25.30 


22.0 








0.90 
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-684.59 


25.70 


22.0 


-598.05 


25.51 


14.9 


Leukemia 


1255 


72 


0.01 


249216 


3229.75 


28.25 


705.7 


3555.38 


28.12 


177.1 








0.05 


169144 


1308.38 


29.85 


491.1 


2996.95 


28.45 


189.2 








0.10 


107180 


505.02 


30.53 


501.4 


2531.62 


28.82 


202.8 








0.50 


37914 


-931.59 


31.65 


345.9 


797.23 


30.16 


256.6 








0.70 


4764 


-1367.22 


31.84 


125.7 


-1012.48 


31.48 


271.6 








0.90 


24 


-1465.70 


31.90 


110.6 


-1301.99 


31.68 


187.8 



In the third experiment, we aim to compare the performance of our PD method with the PPA on 
two gene expression data sets that have been widely used in the literature (see, for example, [Ml H3 EH 
El El]). We first pre-process the data by the same procedure as described in [3l] to obtain a sample 
covariance matrix and set Q, = and pij = p^ for some p^ > 0. We apply PPA to solve problem (j4*8j) 
with Pq = 0.01, 0.05, 0.1, 0.5, 0.7 and 0.9, respectively. For each p^, we choose r to be the number of 
nonzero off-diagonal entries of the solution of PPA, which implies that the solution of the PD method 
when applied to (|47j) is at least as sparse as that of PPA. As the true covariance matrix is unknown 
for these data sets, we now modify the normalized entropy loss defined above by replacing by 5]. 
The results of PPA and our PD method on these two data sets are presented in Table [71 In detail, the 
name and dimension of each data set are given in the first three columns. The values of Pq and r are 
listed in the fourth and fifth columns. The log-likelihood, the normalized entropy loss and the CPU 
time (in seconds) of PPA and the PD method are given in the last six columns, respectively. We can 
observe that our PD method is generally faster than PPA. Moreover, our PD method outperforms PPA 
in terms of log-likelihood and normalized entropy loss. 

As a summary, the above experiments show that the quality of the approximate solution of (|47p 
obtained by our PD method is generally better than that of (|48p found by PPA when the same sparsity 
is considered. 

5.3 Compressed sensing 

In this subsection, we apply the PD methods proposed in Section 0] to solve the compressed sensing 
(CS) problem, which has important applications in signal processing (see, for example, [131 1501 l31~l 148} 

iniEHiEa]). 

When the observation is noise free, the CS problem can be formulated as 

min{||x||o : Ax = b}, (51) 

where A E sft nx P is a data matrix and b E K n is an observation vector. One popular approach for finding 
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an approximate solution to (|5ip is to solve the following l\ regularization problem: 



min{||x||i : Ax = b], (52) 

(see, for example, [551 [10]). O ur aim below is to apply the PD method studied in Subsection 14.21 to 
solve problem (15ip directly. 

Clearly, problem (|5ip is in the form of ([2]) and thus the PD method proposed in Subsection 14.21 can 
be suitably applied to solve (|5ip . Also, one can observe that the main computation effort of the PD 
method when applied to (|5ip lies in solving the subproblem arising in step la), which is in the form of 

min{||x — c|| 2 : Ax = b} (53) 

X 

for some c £ K p . It is well known that problem (|53p has a closed- form solution given by 

x* = c-A T (AA T )- 1 (Ac-b). 

We now address the initialization and the termination criteria for the PD method. In particular, we 
choose Uq to be a feasible point of (15ip with at most n nonzero entries which can be obtained by 
executing the Matlab command A \ b. Also, we set the initial penalty parameter go = 0.1 and the 
parameter a = 10. In addition, we use (|14p and 

u k lew 

F - y oo 

< eo 



max{|p efc (x k ,y k )\,l} 



as the inner and outer termination criteria, and set the associated accuracy parameters ej = 10 -5 and 
eo = 10 -6 , respectively. 

We next conduct experiments to test the performance of our PD method for solving problem (|5ip 
on random data. We also compare the quality of the approximate solutions of (I5ip obtained by our PD 
method with that of (|52p found by a first-order solver SPGL1 [55j . For the latter method, we use the 
default values for all parameters. 

In the first experiment, given an integer r 6 [l,p], we randomly generate 100 instances according 
to the standard Gaussian distribution. Each one consists of a sparse signal u with cardinality r and a 
data matrix A £ K nxp . Then we generate the corresponding observation vector b by letting b = Au. 
In particular, we choose n = 1024, p = 4096. The values of r range from 30 to 300 (see Table [8]). 
We now try to recover u by applying the PD method and SPGL1 to solve (|5ip and (I52p . respectively. 
To evaluate the solution quality of these methods, we adopt a similar criterion as described in \4Q\ [7]. 
Given an approximate recovery x* for u, we define the mean squared error as 

MSE:= \\x* -u\\/p. 

We say u is successfully recovered by x* if the cardinality of x* is the same as u and moreover the 
corresponding MSE is less than 10 -4 . The computational results of both methods are presented in 
Table El In detail, the values of r are given in the first column. The number of successfully recovered 
signals (NS) and the CPU time for both methods are reported in columns two to five, respectively. 
We observe that the recoverability of two methods is similar for the instances with relatively small r, 
but the PD method outperforms SPGL1 when r becomes larger. We also see that the speed of both 
methods is comparable. 
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Table 8: Computational results for A with non-orthonormal rows 
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Table 9: Computational results for A with orthonormal rows 





SPGL1 




PD 


Cardinality 


NS 


Time 


NS 


Time 


30 


100 


0.4 


100 


1.8 


60 


100 


0.6 


100 


2.1 


90 


100 


0.7 


100 


2.4 


120 


100 


1.0 


100 


2.8 


150 


92 


1.4 


95 


3.2 


180 


91 


2.1 


95 


4.0 


210 


73 


3.9 


92 


5.5 


240 


29 


9.0 


61 


12.1 
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1 
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11 
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11.3 


1 
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The second experiment is similar to the first one except that A is randomly generated with orthonor- 
mal rows. The computational results of both methods are presented in Table EJ We also observe that 
the PD method outperforms SPGL1 in terms of recoverability. 

In the remainder of this subsection we consider the CS problem with noisy observation. In this case, 
the CS problem can be formulated as 

mm|^Px-6|| 2 : ||x[| <r 

where A £ 3ft nxp is a data matrix, b £ is an observation vector, and r S [l,p] is some integer for 
controlling the sparsity of the solution. One popular approach for finding an approximate solution to 
(|54p is to solve the following l\ regularization problem: 

min -\\Ax - b\\ 2 + A||x||i, (55) 

where A > is a regularization parameter (see, for example, [20j [25l [27] ) . Our goal below is to apply 
the PD method studied in Subsection 14. II to solve ([54")) directly. 

Clearly, problem (154j) is in the form of (pQ) and thus the PD method proposed in Subsection [4] can 
be suitably applied to solve (|54|) . The main computation effort of the PD method when applied to (|54p 
lies in solving the subproblem arising in step la), which is an unconstrained quadratic programming 
problem that can be solved by the conjugate gradient method. We now address the initialization and 
the termination criteria for the PD method. In particular, we randomly choose an initial point G 
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such that 1 1 2/0 1 1 < r - Also, we set the initial penalty parameter qq = 1 and the parameter a = \/l0. In 
addition, we use 

k gfc (^+i,yf+i) -q ek ( x hyi)\ < ej 

max{\q Bk (xf,yf)\,l} 

and 

\\x k - y k \\oo 
max{\q Qk (x k ,y k )\,l} ~ 

as the inner and outer termination criteria for the PD method, and set their associated accuracy 
parameters ej = 10 -2 and eo = 1CP 3 . 

We next conduct numerical experiments to test the performance of our PD method for solving 
problem (f54"|) on random data. We also compare the quality of the approximate solutions of ([54"]) 
obtained by our PD method and the iterative hard-thresholding algorithm (IHT) [5, 6j with that of (|55j) 
found by a first-order solver GPSR [20]. For IHT, we set stopTol = 10~ 6 and use the default values for 
all other parameters. And for GPSR, all the parameters are set as their default values. 

We first randomly generate a data matrix A E $t nxp and an observation vector b E ffl 1 according to 
a standard Gaussian distribution. Then we apply GPSR to problem ([55]) with a set of p distinct A's so 
that the cardinality of the resulting approximate solution gradually increases from 1 to p. Accordingly, 
we apply our PD method and IHT to problem (|54p with r = 1, . . . ,p. It shall be mentioned that a 
warm-start strategy is applied to all three methods. That is, an approximate solution of problem (]54p 
(resp., (|52p ) for current r (resp., A) is used as the initial point for the PD method and IHT (resp., 
GPSR) when applied to the problem for next r (resp., A). The average computational results of both 
methods over 100 random instances with (n,p) = (1024,4096) are plotted in Figure [2] In detail, we 
plot the average residual ||Ac — b\\ against the cardinality in the left graph and the average accumulated 
CPU time0 (in seconds) against the cardinality in the right graph. We observe that the residuals of the 
approximate solutions of (|55p obtained by our PD method and IHT are almost equal and substantially 
smaller than that of (I54p found by GPSR when the same sparsity is considered. In addition, we can see 
that GPSR is faster than the other two methods. 

We also conduct a similar experiment as above except that A is randomly generated with orthonormal 
rows. The results are plotted in Figure [3] We observe that the PD method and IHT are generally slower 
than GPSR, but they have better solution quality than GPSR in terms of residuals. 



6 Concluding remarks 

In this paper we propose penalty decomposition methods for general Iq minimization problems in which 
each subproblem is solved by a block coordinate descend method. Under some suitable assumptions, we 
establish that any accumulation point of the sequence generated by the PD methods satisfies the first- 
order optimality conditions of the problems. Furthermore, for the problems in which the Iq part is the 
only nonconvex part, we show that such an accumulation point is a local minimizer of the problems. The 
computational results on compressed sensing, sparse logistic regression and sparse inverse covariance 
selection problems demonstrate that our methods generally outperform the existing methods in terms 
of solution quality and/or speed. 

We shall remark that the augmented Lagrangian decomposition methods can be developed for 
solving Iq minimization problems (pQ) and (|2]) simply by replacing the quadratic penalty functions in the 

1 For a cardinality r, the corresponding accumulated CPU time is the total CPU time used to compute approximate 
solutions of problem (|54|) or (|52[) with cardinality from 1 to r. 
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A with non-orthonormal rows 
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Figure 2: Trade-off curves. 



.4 with orthonormal rows 



A with orthonormal rows 



-PD 
-IHT 

GPSR 




200 400 600 800 1000 
Cardinality 

(a) Residual vs. Cardinality 




200 400 600 800 1000 
Cardinality 

(b) Time vs. Cardinality 



Figure 3: Trade-off curves. 

PD methods by augmented Lagrangian functions. Nevertheless, as observed in our experiments, their 
practical performance is generally worse than the PD methods. 



Appendix 

In this appendix we provide an example to demonstrate that the l p -norm relaxation approaches for 
p G (0, 1] may fail to recover the sparse solution. 

Let p G (0,1] be arbitrarily chosen. Given any b 1 , b 2 G 5ft n , let b = b 1 + b 2 , a = |[ (& 1 ; fo 2 ) [1^ and 
A = [b 1 , b 2 , al n , al n ], where I n denotes the n x n identity matrix and \\x\\ p = l x i| p ) 1 ^ p f° r 

all x G K n . Consider the linear system Ax = b. It is easy to observe that this system has the sparse 
solution x s = (1, 1, 0, . . . , 0) T . However, x s cannot be recovered by solving the /p-"norm" regularization 
problem: 

/* = nun j/O) := ±\\Ax - b\\ 2 + u\\x\\ p \ 
for any v > 0. Indeed, let x = (0, 0, 6 1 /a, b 2 /a) T . Then, we have f(x s ) = 2 l l p v and f(x) = v, which 
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implies that f(x s ) > f(x) > f*. Thus, x s cannot be an optimal solution of the above problem for any 
v > 0. Moreover, the relative error between f(x s ) and /* is fairly large since 

- n/r > (/(**) - /(*))//(*) = - 1 > i. 

Therefore, the true sparse solution x s may not even be a "good" approximate solution to the Z p -"norm" 
regularization problem. 

References 

[1] O. Banerjee, L. E. Ghaoui, and A. D'Aspremont. Model selection through sparse maximum likeli- 
hood estimation. J. Mach. Learn. Res., 9:485-516, 2008. 

[2] J. A. Bilmes. Factored sparse inverse covariance matrices. International Conference on Acoustics, 
Speech and Signal processing, Washington, D.C., 1009-1012, 2000. 

[3] C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2007. 

[4] E. G. Birgin, J. M. Martinez, and M. Raydan. Nonmonotone spectral projected gradient methods 
on convex sets. SIAM J. Optimiz, 4:1196-1211, 2000. 

[5] T. Blumensath and M. E. Davies. Iterative thresholding for sparse approximations. J. FOURIER 
ANAL. APPL., 14:629-654, 2008. 

[6] T. Blumensath and M. E. Davies. Iterative hard thresholding for compressed sensing. Appl. 
Comput. Harmon. Anal, 27(3): 265-274, 2009. 

[7] E. J. Candes and B. Recht. Exact matrix completion via convex optimization. Found. Comput. 
Math., 2009. 

[8] E. J. Candes, J. Romberg and T. Tao. Robust uncertainty principles: exact signal reconstruction 
from highly incomplete frequency information. IEEE T. Inform. Theory, 52:489-509, 2006. 

[9] R. Chartrand. Exact reconstruction of sparse signals via nonconvex minimization. IEEE Signal 
Proc. Let., 14:707-710, 2007. 

[10] S. Chen, D. Donoho and M. Saunders. Atomic decomposition by basis pursuit. SIAM J. Sci. 
Comput, 20:33-61, 1998. 

[11] X. Chen, F. Xu and Y. Ye. Lower bound theory of nonzero entries in solutions of fo-lp Minimization. 
SIAM J. Sci. Comput, 32:2832-2852, 2010. 

[12] X. Chen and W. Zhou. Convergence of reweighted l\ minimization algorithms and unique solution 
of truncated l p minimization. Technical report, 2010. 

[13] J. Claerbout and F. Muir. Robust modelling of erratic data. Geophysics, 38:826-844, 1973. 

[14] A. D'Aspremont, O. Banerjee and L. E. Ghaoui. First-order methods for sparse covariance selection. 
SIAM J. Matrix Anal. A., 30(l):56-66, 2008. 

[15] J. Dahl, L. Vandenberghe and V. Roychowdhury. Covariance selection for nonchordal graphs via 
chordal embedding. Optim. Method. Softw., 23(4):501-520, 2008. 



28 



[16] A. Dempster. Covariance selection. Biometrics, 28:157-175, 1978. 

[17] A. Dobra. Dependency networks for genome-wide data. Biostatistics, 8(1): 1-28, 2007. 

[18] A. Dobra, C. Hans, B. Jones, J. R. Nevins, G. Yao and M. West. Sparse graphical models for 
exploring gene expression data. J. Multivariate Anal, 90:196-212, 2004. 

[19] B. Efron, T. Hastie, I. Johnstone and R. Tibshirani. Least angle regression. Ann. Stat., 32(2):407- 
499, 2004. 

[20] M. A. T. Figueiredo, R. D. Nowak and S. J. Wright. Gradient projection for sparse reconstruction: 
application to compressed sensing and other inverse problems. IEEE J. Sel. Top. Signa.: Special 
Issue on Convex Optimization Methods for Signal Processing, l(4):586-598, 2007. 

[21] J. Friedman, T. Hastie and R. Tibshirani. Sparse inverse covariance estimation with the graphical 
lasso. Biostat., 9(3):432-441, 2008. 

[22] A. D. Gerson, L. C. Parra and P. Sajda. Cortical origins of response time variability during rapid 
discrimination of visual objects. Neuroimage, 28(2):342-353, 2005. 

[23] G. Golub and C. Van Loan. Matrix Computations, volume 13 of Studies in Applied Mathematics. 
John Hopkins University Press, third edition, 1996. 

[24] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. Loh, 
J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, and E. S. Lander. Molecular classification of 
cancer: class discovery and class prediction by expression monitoring. Science, 286:531-537, 1999. 

[25] E. T. Hale, W. Yin and Y. Zhang. Fixed-point continuation applied to compressed sensing: Im- 
plementation and numerical experiments. J. Comput. Math, 28(2): 170-194, 2010. 

[26] K. K. Herrity, A. C. Gilbert and J. A. Tropp. Sparse approximation via iterative thresholding. 
IEEE International Conference on Acoustics, Speech and Signal Processing, 2006. 

[27] S. J. Kim, K. Koh, M. Lustig, S. Boyd and D. Gorinevsky. An interior-point method for large-scale 
/i-regularized least squares. IEEE J. Sel. Top. Signa., 1(4):606-617, December 2007. 

[28] K. Koh, S. J. Kim and S. Boyd. An interior-point method for large-scale ^-regularized logistic 
regression. J. Mach. Learn. Res., 8:1519-1555, 2007. 

[29] Linear Algebra PACKage. Available at http://www.netlib.org/lapack/index.html 

[30] S. Lee, H. Lee, P. Abbeel and A. Ng. Efficient l\ -regularized logistic regression. In 21th National 
Conference on Artificial Intelligence (AAAI), 2006. 

[31] S. Levy and P. Fullagar. Reconstruction of a sparse spike train from a portion of its spectrum and 
application to high-resolution deconvolution. Geophysics, 46:1235-1243, 1981. 

[32] J. G. Liao and K. V. Chin. Logistic regression for disease classification using microarray data: 
model selection in a large p and small n case. Bioinformatics, 23(15):1945-1951, 2007. 

[33] J. Liu, S. Ji and J. Ye. SLEP: Sparse learning with efficient projections. Arizona State University, 
2009. Available at |http:// www . public . asu . edu/~j ye02 / Software / SLEP . 



29 



[34] L. Li and K. C. Toh. An inexact interior point method for l\ -regularized sparse covariance selection. 
Math. Program. Comput., 2:291-315, 2010. 

[35] Z. Lu. Smooth optimization approach for sparse covariance selection. SIAM J. Optimiz., 19(4):1807- 
1827, 2009. 

[36] Z. Lu. Adaptive first-order methods for general sparse inverse covariance selection. SIAM J. Matrix 
Anal. A., 31(4):2000-2016, 2010. 

[37] Z. Lu and Y. Zhang. Penalty decomposition methods for rank minimization. Technical report, 
Department of Mathematics, Simon Fraser University, Canada, 2010. 

[38] S. Mallat and Z. Zhang. Matching pursuits with time- frequency dictionaries. IEEE T. Image 
Process., 41(12):3397-3415, 1993. 

[39] A. Miller. Subset selection in regression. Chapman and Hall, London, 2002. 

[40] D. Newman, S. Hettich, C. Blake and C. Merz. UCI repository of machine learning databases, 
1998. Available at www.ics.uci.edu/~mlearn/MLRepository.html. 

[41] A. Y. Ng. Feature selection, l\ vs. I2 regularization, and rotational invariance. In Proceedings of 
the Twenty-First International Conference on Machine learning (ICML), 72-85, 2004. 

[42] M. Y. Park and T. Hastie. Regularization path algorithms for detecting gene interactions. Depart- 
ment of Statistics, Stanford University, 2006. 

[43] L. C. Parra, C. D. Spence, A. D. Gerson and P. Sajda. Recipes for the linear analysis of EEC 
Neuroimage, 28(2):326-341, 2005. 

[44] M. G. Philiastides and P. Sajda. Temporal characterization of the neural correlates of perceptual 
decision making in the human brain. Cereb. Cortex, 16(4):509-518, 2006. 

[45] J. Pittman, E. Huang, H. Dressman, C. F. Horng, S. H. Cheng, M. H. Tsou, C. M. Chen, A. Bild, 
E. S. Iversen, A. T. Huang, J. R. Nevins and M. West. Integrated modeling of clinical and gene 
expression information for personalized prediction of disease outcomes. P. Natl. Acad. Sci. USA, 
101(22):8431-8436, 2004. 

[46] B. Recht, M. Fazel, and P. Parrilo. Guaranteed minimum-rank solutions of linear matrix equations 
via nuclear norm minimization. SIAM Rev., 2007. 

[47] A. Ruszczyhski. Nonlinear Optimization. Princeton University Press, 2006. 

[48] F. Santosa and W. Symes. Linear inversion of band-limited reflection histograms. SIAM J. Sci. 
Stat. Comp., 7:1307-1330, 1986. 

[49] J. Shi, W. Yin, S. Osher and P. Sajda. A fast hybrid algorithm for large-scale Zi-regularized logistic 
regression. J. Mach. Learn. Res., 11:713-741, 2010. 

[50] H. Taylor, S. Bank and J. McCoy. Deconvolution with the /i-norm. Geophysics, 44:39-52, 1979. 

[51] R. Tibshirani. Regression shrinkage and selection via the lasso. J. Roy. Stat. Soc. B, 58(l):267-288, 
1996. 



30 



[52] J. A. Tropp. Greed is good: algorithmic results for sparse approximation. IEEE T. Inform. Theory, 
50(10):2231-2242, 2004. 

[53] J. Tropp. Just relax: Convex programming methods for identifying sparse signals. IEEE T. Inform. 
Theory, 51:1030-1051, 2006. 

[54] Y. Tsuruoka, J. McNaught, J. Tsujii and S. Ananiadou. Learning string similarity measures for 
gene/protein name dictionary look-up using logistic regression. Bioinformatics, 23(20):2768-2774, 
2007. 

[55] E. Van Den Berg and M. P. Friedlander. Probing the Pareto frontier for basis pursuit solutions. 
SI AM J. Sci. Comp., 31(2)890-912, 2008. 

[56] C. Wang, D. Sun and K. C. Toh. Solving log-determinant optimization problems by a Newton-CG 
proximal point algorithm. SI AM J. Optimiz., 20(6):2994-3013, 2010. 

[57] K. Y. Yeung, R. E. Bumgarner and A. E. Raftery. Bayesian model averaging: development of an 
improved multi-class, gene selection and classification tool for microarray data. Bioinformatics, 
21(10):2394-2402, 2005. 

[58] Y. Zhang, B. Dong and Z. Lu. Iq minimization for wavelet frame based image restoration. To 
appear in Math. Corn-put., 2011. 



31 



